Search CORE

11 research outputs found

Event Extraction: A Survey

Author: Lai Viet Dac
Publication venue
Publication date: 10/10/2022
Field of study

Extracting the reported events from text is one of the key research themes in natural language processing. This process includes several tasks such as event detection, argument extraction, role labeling. As one of the most important topics in natural language processing and natural language understanding, the applications of event extraction spans across a wide range of domains such as newswire, biomedical domain, history and humanity, and cyber security. This report presents a comprehensive survey for event detection from textual documents. In this report, we provide the task definition, the evaluation method, as well as the benchmark datasets and a taxonomy of methodologies for event extraction. We also present our vision of future research direction in event detection.Comment: 20 page

arXiv.org e-Print Archive

Exploiting the Matching Information in the Support Set for Few Shot Event Classification

Author: Dernoncourt Franck
Lai Viet Dac
Nguyen Thien Huu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/06/2020
Field of study

The existing event classification (EC) work primarily focuseson the traditional supervised learning setting in which models are unableto extract event mentions of new/unseen event types. Few-shot learninghas not been investigated in this area although it enables EC models toextend their operation to unobserved event types. To fill in this gap, inthis work, we investigate event classification under the few-shot learningsetting. We propose a novel training method for this problem that exten-sively exploit the support set during the training process of a few-shotlearning model. In particular, in addition to matching the query exam-ple with those in the support set for training, we seek to further matchthe examples within the support set themselves. This method providesmore training signals for the models and can be applied to every metric-learning-based few-shot learning methods. Our extensive experiments ontwo benchmark EC datasets show that the proposed method can improvethe best reported few-shot learning models by up to 10% on accuracyfor event classificationComment: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 202

arXiv.org e-Print Archive

Crossref

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

Author: Dernoncourt Franck
Lai Viet Dac
Ngo Nghia Trung
Nguyen Thien Huu
Nguyen Thuat
Rossi Ryan A.
Van Nguyen Chien
Publication venue
Publication date: 01/08/2023
Field of study

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi

arXiv.org e-Print Archive

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

Author: Bui Trung
Deilamsalehy Hanieh
Dernoncourt Franck
Lai Viet Dac
Nguyen Thien Huu
Salinas Abel
Tan Hao
Tran Quan
Yoon Seunghyun
Publication venue
Publication date: 24/07/2023
Field of study

Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability. While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. The experiments show that our method achieves state-of-the-art performance on the ASR test set on two benchmark datasets for punctuation restoration.Comment: Accepted at INTERSPEECH 2023, 6 page

arXiv.org e-Print Archive

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Author: Dernoncourt Franck
Lai Viet Dac
Man Hieu
Ngo Nghia Trung
Nguyen Thien Huu
Nguyen Thuat
Rossi Ryan A.
Van Nguyen Chien
Publication venue
Publication date: 17/09/2023
Field of study

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor

arXiv.org e-Print Archive

A study on abstract meaning representation parsing

Author: Lai Dac Viet
Publication venue
Publication date: 01/03/2018
Field of study

Supervisor:NGUYEN, Minh Le先端科学技術研究科修士（情報科学

JAIST Repository